AI bioinformatics: avoiding command line hallucinations

AI bioinformatics

In two previous articles we found that LLMs are very good at explaining complex bioinformatics commnd lines, but not good at identifying incorrect or meaningless arguments.

Now for the natural follow up experiment: what tweaks can we make to the settings or prompts to increase the chances of catching such errors? We will try a few different approachs, and test each one with the same three incorrect prompts that failed to be flagged previously. As a reminder, they are:

a bwa mem command line with a nonexistant --parallel argument
a bwa mem command line with a -q argument that only applies to bwa aln
a blastn command line with a nonexistant --precise argument

All experiments here are done with GPT5.2.

Turning on thinking mode

We will start with a very easy tweak and turn on thinking mode in the model options. This causes the model to output more reasoning tokens before responding.

This simple tweak is surprisingly effective. It causes the model to correctly identify the problem with the first bwa command and flag it directly:

Important: standard bwa mem does not have a --parallel option.

For the second bwa command, thinking mode makes no difference - the model fails to recognise that this isn’t a valid option for bwa mem.

For the blastn command, thinking mode gives a partial success: the model still makes up an explanation, but we get the sense that it might not be exactly right:

--precise Requests a more precise/strict search mode.

Practical effect: typically aims to improve exactness/consistency of results (often at a speed cost).

Note: --precise is not a traditional “classic BLAST+” option in many installations; if your blastn errors with “unrecognized option”, it may be coming from a specific wrapper/build/version. If it works for you, keep it; if not, remove it.

Want to see future experiments? sign up for the mailing list:

Turning on web search

Another thing we can do is to turn on web search in the ChatGPT interface. Of course, the first article that the model finds when searching for any of the commands is the previous one that I wrote using these examples! So we will have to use incorrect arguments that we have not used as examples before.

Instead of passing --parallel we will pass a different, but plausible, incorrect argument: `--skip-errors. With web search turned on, the model becomes more confident that this isn’t allowed:

–skip-errors

Meaning: Despite its appearance here, BWA’s official documentation doesn’t list a –skip-errors option for bwa mem itself.

Instead of passing an incorrect -q argument, we will also try with -o, which is applicable to bwa aln but not bwa mem. This time the model is a bit less confident:

-o 3 – output threshold / gap open penalty? BWA MEM’s documented manual options do not include a standard -o flag for setting read-group options or thread count, and the most common BWA-MEM implementations treat -o as the gap open penalty parameter (numeric).

This is nearly correct; the gap open penalty that the model is referring to is -O with a capital letter. Perhaps there is something going on with the tokenisation that makes it harder for the model to distinguish case?

Finally, how does the model handle the nonexistant --precise argument to blastn?

–precise

This flag isn’t a standard NCBI BLAST+ option and isn’t documented in the official manual — it may be a typo or from a wrapper script/tool.

There is no documented BLAST+ option called –precise. If it was intended to increase rigor, the typical BLAST option for “precision tuning” would be something like -task or scoring matrix adjustments.

In the official BLAST+ docs this flag does not exist. If present, it may be ignored or cause an error.

Pretty unambiguous.

Making errors more prominent

Something that has emerged from this set of experiments is that there are really two questions we are interested in: does the model identify the incorrect argument, and does it flag it up in a human-readable way? Once we have thinking and web search turned on, the model is doing a pretty good job at the former. However, the information about the incorrect argument is not given any special place in the output, so it would be easy for a human reader to miss.

Let’s try modifying the prompt to explicity check for incorrect arguments:

Explain the options and arguments in this command line:

blastn -query queries.fasta -db nt -remote –precise -outfmt “6 qseqid sseqid staxids pident length evalue bitscore” -max_target_seqs 5 -evalue 1e-6 -num_threads 8 -out results.tsv

include details on the roles of each file name. If any of the arguments are incorrect or invalid, report that first and then stop.

Given these explicit instructions, the model surfaces the error much better:

❗ Invalid Option

Before explaining any of the arguments, there is one invalid option in the command:

–precise — This is not a valid BLAST+ option in blastn. … ## 🧠 Key Takeaway

Everything in the command is valid except –precise, which should not be included. Remove that flag and the command should run (given access to the database and connectivity to the remote server). ([NCBI][1])

The error is also given much more prominence in the response when asked about the --skip-errors command in bwa mem:

The option --skip-errors is not a valid BWA-MEM option. I could not find any evidence that bwa mem supports --skip-errors in its documented option list — all standard options use a single dash (e.g., -t, -R, -M) and there is no built-in --skip-errors documented. ([Debian Manpages][1])

❌ Invalid Argument

--skip-errors is not a recognized BWA-MEM option. — You must remove it for the command to run.

Summary

When using LLMs for the task of explaining complex command lines - something that they generally excel at! - we can increase the chance of catching errors by turning on both thinking and web search. We can also increase the prominence of error reporting by tweaking the prompt to make it explicit.

These are probably best practice for using AI in this way, especially since in this set of examples we have used two very widely-discussed tools. More obscure tools are less likely to be represented well in the training set, and so errors would be harder to catch.

If you are interested in doing your own experiments with web search, remember to use different example commands to mine, as this set of articles will easily show up for web searches if you use the same incorrect arguments!

One thing that we haven’t tried so far is to use one of the agentic models that can run the command line and load the output into context. That will be an interesting experiment for the future, as it won’t rely on either knowledge directly embedded in the model, or on context available from a web search.

Turning on thinking mode

Turning on web search

Making errors more prominent

❗ Invalid Option

❌ Invalid Argument

Summary